Power laws, Pareto distributions and Zipf's law 
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When the probability of measuring a particular value of some quantity varies inversely as a power 
of that value, the quantity is said to follow a power law, also known variously as Zipf's law or the 
Pareto distribution. Power laws appear widely in physics, biology, earth and planetary sciences, 
economics and finance, computer science, demography and the social sciences. For instance, 
the distributions of the sizes of cities, earthquakes, solar flares, moon craters, wars and people's 
personal fortunes all appear to follow power laws. The origin of power-law behaviour has been 
a topic of debate in the scientific community for more than a century. Here we review some of 
the empirical evidence for the existence of power-law forms and the theories proposed to explain 
them. 
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I. INTRODUCTION 

Many of the things that scientists measure have a typ- 
ical size or "scale" — a typical value around which in- 
dividual measurements are centred. A simple example 
would be the heights of human beings. Most adult hu- 
man beings are about 180cm tall. There is some varia- 
tion around this figure, notably depending on sex, but we 
never see people who are 10cm tall, or 500cm. To make 
this observation more quantitative, one can plot a his- 
togram of people's heights, as I have done in Fig.^. The 
figure shows the heights in centimetres of adult men in 
the United States measured between 1959 and 1962, and 
indeed the distribution is relatively narrow and peaked 
around 180cm. Another telling observation is the ratio of 
the heights of the tallest and shortest people. The Guin- 
ness Book of Records claims the world's tallest and short- 
est adult men (both now dead) as having had heights 
272cm and 57cm respectively, making the ratio 4.8. This 
is a relatively low value; as we will see in a moment, 
some other quantities have much higher ratios of largest 
to smallest. 

Figure shows another example of a quantity with 
a typical scale: the speeds in miles per hour of cars on 
the motorway. Again the histogram of speeds is strongly 
peaked, in this case around 75mph. 

But not all things we measure are peaked around a typ- 
ical value. Some vary over an enormous dynamic range, 
sometimes many orders of magnitude. A classic example 
of this type of behaviour is the sizes of towns and cities. 
The largest population of any city in the US is 8.00 mil- 
lion for New York City, as of the most recent (2000) cen- 
sus. The town with the smallest population is harder to 
pin down, since it depends on what you call a town. The 
author recalls in 1993 passing through the town of Mil- 
liken, Oregon, population 4, which consisted of one large 
house occupied by the town's entire human population, 
a wooden shack occupied by an extraordinary number 
of cats and a very impressive flea market. According to 
the Guinness Book, however, America's smallest town is 
Dufheld, Virginia, with a population of 52. Whichever 
way you look at it, the ratio of largest to smallest pop- 



ulation is at least 150 000. Clearly this is quite different 
from what we saw for heights of people. And an even 
more startling pattern is revealed when we look at the 
histogram of the sizes of cities, which is shown in Fig. El 

In the left panel of the figure, I show a simple his- 
togram of the distribution of US city sizes. The his- 
togram is highly right-skewed, meaning that while the 
bulk of the distribution occurs for fairly small sizes — 
most US cities have small populations — there is a small 
number of cities with population much higher than the 
typical value, producing the long tail to the right of the 
histogram. This right-skewed form is qualitatively quite 
different from the histograms of people's heights, but is 
not itself very surprising. Given that we know there is a 
large dynamic range from the smallest to the largest city 
sizes, we can immediately deduce that there can only 
be a small number of very large cities. After all, in a 
country such as America with a total population of 300 
million people, you could at most have about 40 cities the 
size of New York. And the 2700 cities in the histogram 
of Fig. |2] cannot have a mean population of more than 
3 X 1072700= 110 000. 

What is surprising on the other hand, is the right panel 
of Fig. [3 which shows the histogram of city sizes again, 
but this time replotted with logarithmic horizontal and 
vertical axes. Now a remarkable pattern emerges: the 
histogram, when plotted in this fashion, follows quite 
closely a straight line. This observation seems first to 
have been made by Auerbach 1], although it is often at- 
tributed to Zipf [31 . What does it mean? Let p{x) dx 
be the fraction of cities with population between x and 
X + dx. If the histogram is a straight line on log-log 
scales, then lnp(a;) = —alnx + c, where a and c are con- 
stants. (The minus sign is optional, but convenient since 
the slope of the line in Fig.[21is clearly negative.) Taking 
the exponential of both sides, this is equivalent to: 



p{x) = Cx 



(1) 



with C ~ e'^. 

Distributions of the form |^ are said to follow a power 
law. The constant a is called the exponent of the power 
law. (The constant C is mostly uninteresting; once a 
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FIG. 1 Left: histogram of heights in centimetres of American males. Data from the National Health Examination Survey, 
1959-1962 (US Department of Ifealth and Human Services). Right: histogram of speeds in miles per hour of cars on UK 
motorways. Data from Transport Statistics 2003 (UK Department for Transport). 




population of city 

FIG. 2 Left: histogram of the populations of all US cities with population of 10 000 or more. Right: another histogram of the 
same data, but plotted on logarithmic scales. The approximate straight-line form of the histogram in the right panel implies 
that the distribution follows a power law. Data from the 2000 US Census. 



is fixed, it is determined by the requirement that the 
distribution p{x) sum to 1; see Section IlILAI ) 

Power-law distributions occur in an extraordinarily di- 
verse range of phenomena. In addition to city popula- 
tions, the sizes of earthquakes 0, moon craters solar 
flares Q , computer files and wars f3| , the frequency of 
use of words in any human language 2', 8] , the frequency 
of occurrence of personal names in most cultures [Sj , the 
numbers of papers scientists write jUl], the number of 
citations received by papers Jjj , the number of hits on 
web pages , the sales of books, music recordings and 
almost every other branded commodity 0,0], the num- 
bers of species in biological taxa people's annual in- 
comes [ig and a host of other variables all follow power- 
law distributions.^ 



Power laws also occur in many situations other than the statis- 



Power-law distributions are the subject of this arti- 
cle. In the following sections, I discuss ways of detecting 
power-law behaviour, give empirical evidence for power 
laws in a variety of systems and describe some of the 
mechanisms by which power-law behaviour can arise. 

Readers interested in pursuing the subject further may 
also wish to consult the reviews by Sornette 18] and 
Mitzenmacher as well as the bibliography by Li.^ 



tical distributions of quantities. For instance, Newton's famous 
law for gravity has a power-law form with exponent a = 2. 
While such laws are certainly interesting in their own way, they 
are not the topic of this paper. Thus, for instance, there has 
in recent years been some discussion of the "allometric" scal- 
ing laws s een in the physiognomy and physiology of biological 
organisms llTl . but since these are not statistical distributions 
they will not be discussed here. 

^ http : //linkage . rockefeller . edu/wli/zipf /. 
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FIG. 3 (a) Histogram of the set of 1 million random numbers described in the text, which have a power-law distribution with 
exponent a = 2.5. (b) The same histogram on logarithmic scales. Notice how noisy the results get in the tail towards the 
right-hand side of the panel. This happens because the number of samples in the bins becomes small and statistical fluctuations 
are therefore large as a fraction of sample number, (c) A histogram constructed using "logarithmic binning", (d) A cumulative 
histogram or rank/frequency plot of the same data. The cumulative distribution also follows a power law, but with an exponent 
of Q - 1 = 1.5. 



II. MEASURING POWER LAWS 

Identifying power-law behaviour in either natural or 
man-made systems can be tricky. The standard strategy 
makes use of a result we have already seen: a histogram 
of a quantity with a power-law distribution appears as 
a straight line when plotted on logarithmic scales. Just 
making a simple histogram, however, and plotting it on 
log scales to see if it looks straight is, in most cases, a 
poor way proceed. 

Consider Fig.|3| This example shows a fake data set: 
I have generated a million random real numbers drawn 
from a power-law probability distribution p{x) = Ca:~" 
with exponent a — 2.5, just for illustrative purposes.'^ 
Panel (a) of the figure shows a normal histogram of the 



^ This can be done using the so-called transformation method. If 
we can generate a random real number r uniformly distributed in 
the range < r < 1, then x = x^ij^{l — r)~^/^'^~^') is a random 
power-law-distributed real number in the range Xmin < x < oo 
with exponent a. Note that there has to be a lower limit x^nin 
on the range; the power-law distribution diverges as x — > — see 
Section ITlXI 



numbers, produced by binning them into bins of equal 
size 0.1. That is, the first bin goes from 1 to 1.1, the 
second from 1.1 to 1.2, and so forth. On the linear scales 
used this produces a nice smooth curve. 

To reveal the power-law form of the distribution it is 
better, as we have seen, to plot the histogram on logarith- 
mic scales, and when we do this for the current data we 
see the characteristic straight-line form of the power-law 
distribution. Fig. ^jp. However, the plot is in some re- 
spects not a very good one. In particular the right-hand 
end of the distribution is noisy because of sampling er- 
rors. The power-law distribution dwindles in this region, 
meaning that each bin only has a few samples in it, if 
any. So the fractional fiuctuations in the bin counts are 
large and this appears as a noisy curve on the plot. One 
way to deal with this would be simply to throw out the 
data in the tail of the curve. But there is often useful in- 
formation in those data and furthermore, as we will see 
m Section ITTT^ many distributions follow a power law 
only in the tail, so wc are in danger of throwing out the 
baby with the bathwater. 

An alternative solution is to vary the width of the bins 
in the histogram. If we are going to do this, we must 
also normalize the sample counts by the width of the 
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bins they fall in. That is, the number of samples in a bin 
of width Aa; should be divided by Ax to get a count per 
unit interval of x. Then the normalized sample count 
becomes independent of bin width on average and we are 
free to vary the bin widths as we like. The most common 
choice is to create bins such that each is a fixed multiple 
wider than the one before it. This is known as loga- 
rithmic binning. For the present example, for instance, 
we might choose a multiplier of 2 and create bins that 
span the intervals 1 to 1.1, 1.1 to 1.3, 1.3 to 1.7 and so 
forth (i.e., the sizes of the bins are 0.1, 0.2, 0.4 and so 
forth). This means the bins in the tail of the distribu- 
tion get more samples than they would if bin sizes were 
fixed, and this reduces the statistical errors in the tail. It 
also has the nice side-effect that the bins appear to be of 
constant width when we plot the histogram on log scales. 

I used logarithmic binning in the construction of 
Fig.l^b, which is why the points representing the individ- 
ual bins appear equally spaced. In Fig. I have done 
the same for our computer-generated power-law data. As 
we can see, the straight-line power-law form of the his- 
togram is now much clearer and can be seen to extend for 
at least a decade further than was apparent in Fig. ^p. 

Even with logarithmic binning there is still some noise 
in the tail, although it is sharply decreased. Suppose the 
bottom of the lowest bin is at Xmin and the ratio of the 
widths of successive bins is a. Then the fcth bin extends 



from Xk-i = X 



min^ fO 



and the expected 



number of samples falling in this interval is 



p{x) dx — C da; 

Xk-l Jxk-i 

a"-i - 1 



C- 



a-1 



— a+1 



(2) 



Thus, so long as a > 1, the number of samples per bin 
goes down as k increases and the bins in the tail will have 
more statistical noise than those that precede them. As 
we will see in the next section, most power-law distribu- 
tions occurring in nature have 2 < a < 3, so noisy tails 
are the norm. 

Another, and in many ways a superior, method of plot- 
ting the data is to calculate a cumulative distribution 
function. Instead of plotting a simple histogram of the 
data, we make a plot of the probability P{x) that x has 
a value greater than or equal to x: 



P{x) 



p{x') dx' . 



(3) 



The plot we get is no longer a simple representation of 
the distribution of the data, but it is useful nonetheless. 
If the distribution follows a power la^w p{x) — Cx^°' , then 



P{x) = C 



dx' 



C 



(4) 



which is 1 less than the original exponent. Thus, if we 
plot P{x) on logarithmic scales we should again get a 
straight line, but with a shallower slope. 

But notice that there is no need to bin the data at 
all to calculate P{x). By its definition, P{x) is well- 
defined for every value of x and so can be plotted as a 
perfectly normal function without binning. This avoids 
all questions about what sizes the bins should be. It 
also makes much better use of the data: binning of data 
lumps all samples within a given range together into the 
same bin and so throws out any information that was 
contained in the individual values of the samples within 
that range. Cumulative distributions don't throw away 
any information; it's all there in the plot. 

Figure shows our computer-generated power-law 
data as a cumulative distribution, and indeed we again 
see the tell-tale straight-line form of the power law, but 
with a shallower slope than before. Cumulative distribu- 
tions like this are sometimes also called rank/frequency 
plots for reasons explained in Appendix ^ Cumula- 
tive distributions with a power-law form are sometimes 
said to follow Zipf's law or a Pareto distribution, af- 
ter two early researchers who championed their study. 
Since power-law cumulative distributions imply a power- 
law form for p{x), "Zipf's law" and "Pareto distribu- 
tion" are effectively synonymous with "power-law distri- 
bution". (Zipf's law and the Pareto distribution differ 
from one another in the way the cumulative distribution 
is plotted — Zipf made his plots with x on the horizon- 
tal axis and P{x) on the vertical one; Pareto did it the 
other way around. This causes much confusion in the lit- 
erature, but the data depicted in the plots are of course 
identical.'') 

We know the value of the exponent a for our artifi- 
cial data set since it was generated deliberately to have 
a particular value, but in practical situations we would 
often like to estimate a from observed data. One way 
to do this would be to fit the slope of the line in plots 
like Figs.lSb, c or d, and this is the most commonly used 
method. Unfortunately, it is known to introduce system- 
atic biases into the value of the exponent [3 , so it should 
not be relied upon. For example, a least-squares fit of a 
straight line to Fig. gives a = 2.26 ± 0.02, which is 
clearly incompatible with the known value of a = 2.5 
from which the data were generated. 

An alternative, simple and reliable method for extract- 
ing the exponent is to employ the formula 



a = 1 + n 



El- 



(5) 



Here the quantities Xi, i — 1 . . . n are the measured values 
of X and Xmin is again the minimum value of x. (As 



Thus the cumulative distribution function P{x) also fol- 
lows a power law, but with a different exponent a — 1, 



* See http://www.hpl.hp.com/research/idl/papers/ranking/ 

for a useful discussion of these and related points. 
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discussed in the following section, in practical situations 
Xmin usually corresponds not to the smallest value of x 
measured but to the smallest for which the power-law 
behaviour holds.) An estimate of the expected statistical 
error a on jSJ is given by 



El- 



a — 1 



(6) 



The derivation of both these formulas is given in Ap- 
pendix 

Applying Eqs. © and © to our present data gives an 
estimate of a = 2.500 ± 0.002 for the exponent, which 
agrees well with the known value of 2.5. 



A. Examples of power laws 

In Fig. ^ we show cumulative distributions of twelve 
different quantities measured in physical, biological, tech- 
nological and social systems of various kinds. All have 
been proposed to follow power laws over some part of 
their range. The ubiquity of power-law behaviour in the 
natural world has led many scientists to wonder whether 
there is a single, simple, underlying mechanism link- 
ing all these different systems together. Several candi- 
dates for such mechanisms have been proposed, going by 
names like "self-organized criticality" and "highly opti- 
mized tolerance" . However, the conventional wisdom is 
that there are actually many different mechanisms for 
producing power laws and that different ones are appli- 
cable to different cases. We discuss these points further 
in Section HVI 

The distributions shown in Fig. 0] are as follows. 

(a) Word frequency: Estoup ^ observed that the 
frequency with which words are used appears to fol- 
low a power law, and this observation was famously 
examined in depth and confirmed by Zipf [3. 
Panel (a) of Fig. 0] shows the cumulative distribu- 
tion of the number of times that words occur in a 
typical piece of English text, in this case the text of 
the novel Moby Dick by Herman Melville.^ Similar 
distributions are seen for words in other languages. 

(b) Citations of scientific papers: As first observed 
by Price the numbers of citations received by 
scientific papers appear to have a power-law distri- 
bution. The data in panel (b) are taken from the 
Science Citation Index, as collated by Redner [23| . 
and are for papers published in 1981. The plot 



^ The most common words in this case are, in order, "the" , "of" , 
"and" , "a" and "to" , and the same is true for most written En- 
ghsh texts. Interestingly, however, it is not true for spoken En- 
ghsh. The most common words in spok en EngUsh are, in order, 
"I", "and", "the", "to" and "that" 



shows the cumulative distribution of the number of 
citations received by a paper between publication 
and June 1997. 

(c) Web hits: The cumulative distribution of the 
number of "hits" received by web sites (i.e., servers, 
not pages) during a single day from a subset of the 
users of the AOL Internet service. The site with 
the most hits, by a long way, was yahoo . com. Af- 
ter Adamic and Huberman 12]. 

(d) Copies of books sold: The cumulative distribu- 
tion of the total number of copies sold in Amer- 
ica of the 633 bestselling books that sold 2 million 
or more copies between 1895 and 1965. The data 
were compiled painstakingly over a period of sev- 
eral decades by Alice Hackett, an editor at Pub- 
lisher's Weekly The best selling book dur- 
ing the period covered was Benjamin Spock's The 
Common Sense Book of Baby and Child Care. (The 
Bible, which certainly sold more copies, is not really 
a single book, but exists in many different transla- 
tions, versions and publications, and was excluded 
by Hackett from her statistics.) Substantially bet- 
ter data on book sales than Hackett 's are now avail- 
able from operations such as Nielsen BookScan, but 
unfortunately at a price this author cannot afford. 
I should be very interested to see a plot of sales 
figures from such a modern source. 

(e) Telephone calls: The cumulative distribution of 
the number of calls received on a single day by 51 
million users of AT&T long distance telephone ser- 
vice in the United States. After Aiello et al. 24] . 
The largest number of calls received by a customer 
in that day was 375 746, or about 260 calls a minute 
(obviously to a telephone number that has many 
people manning the phones). Similar distributions 
are seen for the number of calls placed by users and 
also for the numbers of email messages that people 
send and receive 1251 1261. 



spok e 



(f) Magnitude of earthquakes: The cumulative dis- 
tribution of the Richter (local) magnitude of earth- 
quakes occurring in California between January 
1910 and May 1992, as recorded in the Berkeley 
Earthquake Catalog. The Richter magnitude is de- 
fined as the logarithm, base 10, of the maximum 
amplitude of motion detected in the earthquake, 
and hence the horizontal scale in the plot, which 
is drawn as linear, is in effect a logarithmic scale 
of amplitude. The power law relationship in the 
earthquake distribution is thus a relationship be- 
tween amplitude and frequency of occurrence. The 
data are from the National Geophysical Data Cen- 
ter, www . ngdc . noaa . gov. 

(g) Diameter of moon craters: The cumulative dis- 
tribution of the diameter of moon craters. Rather 
than measuring the (integer) number of craters of 
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FIG. 4 Cumulative distributions or "rank/frequency plots" of twelve quantities reputed to follow power laws. The distributions 
were computed as described in Appendix^ Data in the shaded regions were excluded from the calculations of the exponents 
in Table H] Source references for the data are given in the text, (a) Numbers of occurrences of words in the novel Moby Dick 
by Hermann Melville, (b) Numbers of citations to scientific papers published in 1981, from time of publication until June 
1997. (c) Numbers of hits on web sites by 60 000 users of the America Online Internet service for the day of 1 December 1997. 
(d) Numbers of copies of bestselling books sold in the US between 1895 and 1965. (e) Number of calls received by AT&T 
telephone customers in the US for a single day. (f) Magnitude of earthquakes in California between January 1910 and May 1992. 
Magnitude is proportional to the logarithm of the maximum amplitude of the earthquake, and hence the distribution obeys a 
power law even though the horizontal axis is linear, (g) Diameter of craters on the moon. Vertical axis is measured per square 
kilometre, (h) Peak gamma-ray intensity of solar flares in counts per second, measured from Earth orbit between February 
1980 and November 1989. (i) Intensity of wars from 1816 to 1980, measured as battle deaths per 10 000 of the population of the 
participating countries, (j) Aggregate net worth in dollars of the richest individuals in the US in October 2003. (k) Frequency 
of occurrence of family names in the US in the year 1990. (1) Populations of US cities in the year 2000. 
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a given size on the whole surface of the moon, the 
vertical axis is normalized to measure number of 
craters per square kilometre, which is why the axis 
goes below 1, unlike the rest of the plots, since it is 
entirely possible for there to be less than one crater 
of a given size per square kilometre. After Neukum 
and Ivanov 0. 

(h) Intensity of solar flares: The cumulative dis- 
tribution of the peak gamma-ray intensity of 
solar flares. The observations were made be- 
tween 1980 and 1989 by the instrument known 
as the Hard X-Ray Burst Spectrometer aboard 
the Solar Maximum Mission satellite launched 
in 1980. The spectrometer used a Csl scin- 
tillation detector to measure gamma-rays from 
solar flares and the horizontal axis in the fig- 
ure is calibrated in terms of scintillation counts 
per second from this detector. The data are 
from the NASA Goddard Space Flight Center, 
umbra . nas com . nas a . gov/ smm/hxrbs . html . See 
also Lu and Hamilton j^. 

(i) Intensity of wars: The cumulative distribution 
of the intensity of 119 wars from 1816 to 1980. In- 
tensity is defined by taking the number of battle 
deaths among all participant countries in a war, 
dividing by the total combined populations of the 
countries and multiplying by 10 000. For instance, 
the intensities of the First and Second World Wars 
were 141.5 and 106.3 battle deaths per 10 000 re- 
spectively. The worst war of the period covered 
was the small but horrifically destructive Paraguay- 
Bolivia war of 1932-1935 with an intensity of 382.4. 
The data are from Small and Singer [23 ■ See also 
Roberts and Turcotte p]. 

(j) Wealth of the richest people: The cumulative 
distribution of the total wealth of the richest people 
in the United States. Wealth is defined as aggre- 
gate net worth, i.e., total value in dollars at current 
market prices of all an individual's holdings, minus 
their debts. For instance, when the data were com- 
piled in 2003, America's richest person, William H. 
Gates HI, had an aggregate net worth of $46 bil- 
lion, much of it in the form of stocks of the company 
he founded, Microsoft Corporation. Note that net 
worth doesn't actually correspond to the amount of 
money individuals could spend if they wanted to: 
if Bill Gates were to sell all his Microsoft stock, for 
instance, or otherwise divest himself of any signif- 
icant portion of it, it would certainly depress the 
stock price. The data are from Forbes magazine, 6 
October 2003. 

(k) Frequencies of family names: Cumulative dis- 
tribution of the frequency of occurrence in the US of 
the 89 000 most common family names, as recorded 
by the US Census Bureau in 1990. Similar distribu- 
tions are observed for names in some other cultures 



as well (for example in Japan [23 ) but not in all 
cases. Korean family names for instance appear to 
have an exponential distribution |29l |. 

(1) Populations of cities: Cumulative distribution 
of the size of the human populations of US cities as 
recorded by the US Census Bureau in 2000. 

Few real-world distributions follow a power law over 
their entire range, and in particular not for smaller val- 
ues of the variable being measured. As pointed out in 
the previous section, for any positive value of the expo- 
nent a the function p{x) — Cx^°' diverges as x — *■ 0. In 
reality therefore, the distribution must deviate from the 
power-law form below some minimum value x-a^m- In our 
computer-generated example of the last section we sim- 
ply cut off the distribution altogether below Xmin so that 
p{x) = in this region, but most real-world examples 
are not that abrupt. Figure 0] shows distributions with 
a variety of behaviours for small values of the variable 
measured; the straight-line power-law form asserts itself 
only for the higher values. Thus one often hears it said 
that the distribution of such-and-such a quantity "has a 
power-law tail" . 

Extracting a value for the exponent a from distribu- 
tions like these can be a little tricky, since it requires 
us to make a judgement, sometimes imprecise, about the 
value a;min above which the distribution follows the power 
law. Once this judgement is made, however, a can be 
calculated simply from Eq. jSJ.^ (Care must be taken to 
use the correct value of n in the formula; n is the number 
of samples that actually go into the calculation, exclud- 
ing those with values below x-a^m, not the overall total 
number of samples.) 

Table |2 lists the estimated exponents for each of the 
distributions of Fig. ^ along with standard errors and 
also the values of Xmin used in the calculations. Note 
that the quoted errors correspond only to the statistical 
sampling error in the estimation of a\ they include no 
estimate of any errors introduced by the fact that a single 
power-law function may not be a good model for the data 
in some cases or for variation of the estimates with the 
value chosen for Xmin- 

In the author's opinion, the identification of some of 
the distributions in Fig.0]as following power laws should 
be considered unconfirmed. While the power law seems 
to be an excellent model for most of the data sets de- 
picted, a tenable case could be made that the distribu- 
tions of web hits and family names might have two differ- 
ent power-law regimes with slightly different exponents.'' 



® Sometimes the tail is also cut off because there is, for one reason 
or another, a limit on the largest value that may occur. An 
example is the finite-size effects found in critical pheno men a — 
see Section ITVn In this case, Eq. jFJ must be modified |20| . 

^ Significantly more tenuous claims to power-law behaviour for 
other quantities have appeared elsewhere in the literature, for 
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quantity 


minimum 


exponent 
a 


(a) 


frequency of use of words 


1 


2.20(1) 


(b) 


number of citations to papers 


100 


3.04(2) 


(c) 


number of hits on web sites 


1 


2.40(1) 


(d) 


copies of books sold in the US 


2 000 000 


3.51(16) 


(e) 


telephone calls received 


10 


2.22(1) 


(f) 


magnitude of earthquakes 


3.8 


3.04(4) 


(g) 


diameter of moon craters 


0.01 


3.14(5) 


(h) 


intensity of solar flares 


200 


1.83(2) 


(i) 


intensity of wars 


3 


1.80(9) 


(j) 


net worth of Americans 


$600m 


2.09(4) 


(k) 


frequency of family names 


10 000 


1.94(1) 


(1) 


population of US cities 


40 000 


2.30(5) 



TABLE I Parameters for the distributions shown in Fig. 0] 
The labels on the left refer to the panels in the flgure. Expo- 
nent values were calculated using the maximum likelihood 
method of Eq. ^ and Appendix ^ except for the moon 
craters (g), for which only cumulative data were available. For 
this case the exponent quoted is from a simple least-squares flt 
and should be treated with caution. Numbers in parentheses 
give the standard error on the trailing figures. 



And the data for the numbers of copies of books sold 
cover rather a small range — little more than one decade 
horizontally. Nonetheless, one can, without stretching 
the interpretation of the data unreasonably, claim that 
power-law distributions have been observed in language, 
demography, commerce, information and computer sci- 
ences, geology, physics and astronomy, and this on its 
own is an extraordinary statement. 
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FIG. 5 Cumulative distributions of some quantities whose 
distributions span several orders of magnitude but that 
nonetheless do not follow power laws, (a) The number of 
sightings of 591 species of birds in the North American Breed- 
ing Bird Survey 2003. (b) The number of addresses in the 
email address books of 16 881 users of a large university com- 
puter system 33]. (c) The size in acres of all wildflres occur- 
ring on US federal land between 1986 and 1996 (National Fire 
Occurrence Database, USDA Forest Service and Department 
of the Interior). Note that the horizontal axis is logarithmic 
in frames (a) and (c) but linear in frame (b). 



B. Distributions that do not follow a power law 

Power-law distributions are, as we have seen, impres- 
sively ubiquitous, but they are not the only form of broad 
distribution. Lest I give the impression that everything 
interesting follows a power law, let me emphasize that 
there are quite a number of quantities with highly right- 
skewed distributions that nonetheless do not obey power 
laws. A few of them, shown in Fig. [S] are the following: 

(a) The abundance of North American bird species, 
which spans over five orders of magnitude but is 
probably distributed according to a log-normal. A 
log-normally distributed quantity is one whose log- 
arithm is normally distributed; see Section IIV.GI 
and Ref. [33 | for further discussions. 

(b) The number of entries in people's email address 



instance in the discussi on of the distribution of the sizes of elec- 
trical blackouts 13(11311 . These however I consider insufficiently 
substantiated for inclusion in the present work. 



books, which spans about three orders of magni- 
tude but seems to follow a stretched exponential. 
A stretched exponential is curve of the form e""^ 
for some constants a, b. 

(c) The distribution of the sizes of forest fires, which 
spans six orders of magnitude and could follow a 
power law but with an exponential cutoff. 

This being an article about power laws, I will not discuss 
further the possible explanations for these distributions, 
but the scientist confronted with a new set of data having 
a broad dynamic range and a highly skewed distribution 
should certainly bear in mind that a power-law model is 
only one of several possibilities for fitting it. 

III. THE MATHEMATICS OF POWER LAWS 

A continuous real variable with a power-law distribu- 
tion has a probability p{x) dx of taking a value in the 
interval from a; to x -I- dec, where 

p{x) = Cx-^, (7) 
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with a > 0. As we saw in Section Hi. Al there must be 
some lowest value x^ia at which the power law is obeyed, 
and we consider only the statistics of x above this value. 



A. Normalization 

The constant C in Eq. Q is given by the normalization 
requirement that 



p{x)dx — C 



x-"dx 



C 



1 



a 



(8) 

We see immediately that this only makes sense if a > 
1, since otherwise the right-hand side of the equation 
would diverge: power laws with exponents less than unity 
cannot be normalized and don't normally occur in nature. 
If a > 1 then Eq. © gives 



C = {a-l)x 



a-l 
mill ' 



(9) 



and the correct normalized expression for the power law 
itself is 



p{x) 



a-l 



(10) 



Some distributions follow a power law for part of their 
range but are cut off at high values of x. That is, above 
some value they deviate from the power law and fall off 
quickly towards zero. If this happens, then the distribu- 
tion may be normalizable no matter what the value of 
the exponent a. Even so, exponents less than unity are 
rarely, if ever, seen. 



B. Moments 



then the mean of those many means is itself also for- 
mally divergent, since it is simply equal to the mean we 
would calculate if all the repetitions were combined into 
one large experiment. This implies that, while the mean 
may take a relatively small value on any particular repe- 
tition of the experiment, it must occasionally take a huge 
value, in order that the overall mean diverge as the num- 
ber of repetitions does. Thus there must be very large 
fluctuations in the value of the mean, and this is what 
the divergence in Eq. really implies. In effect, our 
calculations are telling us that the mean is not a well 
defined quantity, because it can vary enormously from 
one measurement to the next, and indeed can become 
arbitrarily large. The formal divergence of (x) is a signal 
that, while we can quote a figure for the average of the 
samples we measure, that figure is not a reliable guide to 
the typical size of the samples in another instance of the 
same experiment. 

For a > 2 however, the mean is perfectly well defined, 
with a value given by Eq. of 



1 



a - 2 



(12) 



We can also calculate higher moments of the distribu- 
tion p{x). For instance, the second moment, the mean 
square, is given by 



C 



3 — a 



-Q + 3 



(13) 



This diverges if a < 3. Thus power-law distributions in 
this range, which includes almost all of those in Tabled 
have no meaningful mean square, and thus also no mean- 
ingful variance or standard deviation. If a > 3, then the 
second moment is finite and well-defined, taking the value 



The mean value of our power-law distributed quan- 
tity X is given by 



[X) = / xp{x) dx = C / X dx 
C 



2- a 



X 



-a+2 



(11) 



Note that this expression becomes infinite if a < 2. 
Power laws with such low values of a have no finite mean. 
The distributions of sizes of solar flares and wars in Ta- 
ble ^ are examples of such power laws. 

What does it mean to say that a distribution has an 
infinite mean? Surely we can take the data for real solar 
flares and calculate their average? Indeed we can and 
necessarily we will always get a flnite number from the 
calculation, since each individual measurement x is itself 
a finite number and there are a finite number of them. 
Only if we had a truly infinite number of samples would 
we see the mean actually diverge. 

However, if we were to repeat our finite experiment 
many times and calculate the mean for each repetition. 



a — 1 r. 

T ■ 

o mm ■ 

a — 3 



(14) 



These results can easily be extended to show that in 
general all moments (x™) exist for m < a — 1 and all 
higher moments diverge. The ones that do exist are given 
by 



{xn 



1 



a — 1 — m 



(15) 



C. Largest value 

Suppose we draw n measurements from a power-law 
distribution. What value is the largest of those measure- 
ments likely to take? Or, more precisely, what is the 
probability 7r(a;) da; that the largest value falls in the in- 
terval between x and x + dx7 

The definitive property of the largest value in a sample 
is that there are no others larger than it. The probability 
that a particular sample will be larger than x is given by 
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the quantity P{x) defined in Eq. 

C 



P{x) 



p{x') dx' — 



a — 1 



-a+l 



SO long as a > 1. And tlic probability that a sample is 
not greater than a; is 1 — P{x). Thus the probability that 
a particular sample we draw, sample i, will lie between 
X and X + dx and that all the others will be no greater 
than it is p{x) da; x [1 — P{x)]"~^. Then there are n ways 
to choose i, giving a total probability 



7r(a;) = np{x)[l — P{x)] 



ji-i 



(17) 



Now we can calculate the mean value (xmax) of the 
largest sample thus: 



XTr{x)dx — n j xp{x)[l — P{x)]"' dx. 

(18) 



Using Eqs. and (HHJ), this is 



— n{a — 


l)x 




( ' 




\ •^miii 


— TiXniin 


I (1 


— ^^min 


B(n,(< 



-Q + l 



da; 



where I have made the substitution y — 1~ {x/xmhi) 
and B(a, b) is Legendre's beta- function,^ which is defined 
by 



B{a,b) 



r(«)r(&) 



r(a + b) 

with r(a) the standard F-function: 



r(a) 



(20) 



(21) 



The beta-function has the interesting property that 
for large values of either of its arguments it itself fol- 
lows a power law.^ For instance, for large a and fixed 6, 
B(a, b) ~ a~''. In most cases of interest, the number n 
of samples from our power-law distribution will be large 
(meaning much greater than 1), so 



B(n, (a - 2)/(a - 1)) ~ jj-("-2)/("-i)^ 



and 



{x„ 



l/(a-l) 



(22) 



(23) 



Also called the Eulerian integral of the first kind. 

This can be demonstrated by approximating the F-functions of 
Eq. 1201 using Sterling's formula. 



Thus, as long as a > 1, we find that (a;inax) 
creases as n becomes larger. 



always in- 



(^26) D. Top-heavy distributions and the 80/20 rule 



Another interesting question is where the majority of 
the distribution of x lies. For any power law with expo- 
nent a > 1, the median is well defined. That is, there is 
a point Xi/2 that divides the distribution in half so that 
half the measured values of x lie above Xi^2 9.nd half lie 
below. That point is given by 



p{x) dx — ^ p{x) dx 



or 



2l/(a- 



(24) 



(25) 



So, for example, if we are considering the distribution 
of wealth, there will be some well-defined median wealth 
that divides the richer half of the population from the 
poorer. But we can also ask how much of the wealth 
itself lies in those two halves. Obviously more than half 
of the total amount of money belongs to the richer half of 
the population. The fraction of the money in the richer 
half is given by 



(19) xp{x)dx 



xp{x) dx \x 



= 1/2 



-a+2 



^ 2-("-2)/("-i)^ (26) 



provided a > 2 so that the integrals converge. Thus, 
for instance, if a — 2.1 for the wealth distribution, as 
indicated in TableHJ then a fraction 2-^ '^^'^ ~ 94% of the 
wealth is in the hands of the richer 50% of the population, 
making the distribution quite top-heavy. 

More generally, the fraction of the population whose 
personal wealth exceeds x is given by the quantity P{x), 
Eq. (|16l) , and the fraction of the total wealth in the hands 
of those people is 



W{x) = 



J^x'p{x')dx' _ 
£^ x'p{x') dx' \x 



-a+2 



(27) 



assuming again that a > 2. Eliminating x/x^ia be- 
tween (|16ll and 127|) . we find that the fraction W of the 
wealth in the hands of the richest P of the population is 



W = p("-2)/("-i) 



(28) 



Equation I23i can also be derived by a simpler, although less 
rigorous, heuristic argument: if P{x) = 1/n for some value of x 
then we expect there to be on average one sample in the range 
from X to oo, and this of course will the largest sample. Thus a 
rough estimate of (xmax) can be derived by setting our expression 
for P{x), Eq. 1161 . equal to 1/n and rearranging for x, which 
immediately gives (xmax 
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FIG. 6 The fraction W of the total wealth in a country held by 
the fraction P of the richest people, if wealth is distributed fol- 
lowing a power law with exponent a. If a = 2.1, for instance, 
as it appears to in the United States (Table^l, then the richest 
20% of the population hold about 86% of the wealth (dashed 
lines) . 



of which Eq. (|26|) is a special case. This again has a 
power-law form, but with a positive exponent now. In 
Fig. El I show the form of the curve of W against P for 
various values of a. For all values of a the curve is con- 
cave downwards, and for values only a little above 2 the 
curve has a very fast initial increase, meaning that a large 
fraction of the wealth is concentrated in the hands of a 
small fraction of the population. Curves of this kind are 
called Lorenz curves, after Max Lorenz, who first studied 
them around the turn of the twentieth century [s^ . 

Using the exponents from Table U we can for example 
calculate that about 80% of the wealth should be in the 
hands of the richest 20% of the population (the so-called 
"80/20 rule", which is borne out by more detailed obser- 
vations of the wealth distribution), the top 20% of web 
sites get about two-thirds of all web hits, and the largest 
10% of US cities house about 60% of the country's total 
population. 

If a < 2 then the situation becomes even more ex- 
treme. In that case, the integrals in Eq. H27I) diverge 
at their upper limits, meaning that in fact they depend 
on the value of the largest sample, as described in Sec- 
tion |IILB1 But for a > 1, Eq. ^ tells us that the 
expected value of Xmax goes to oo as n becomes large, 
and in that limit the fraction of money in the top half 
of the population, Eq. (|2(j|l . tends to unity. In fact, the 
fraction of money in the top anything of the population, 
even the top 1%, tends to unity, as Eq. H27|l shows. In 
other words, for distributions with a < 2, essentially all 
of the wealth (or other commodity) lies in the tail of the 
distribution. The distribution of family names in the US, 
which has an exponent a = 1.9, is an example of this type 



of behaviour. For the data of Fig. 2Jc, about 75% of the 
population have names in the top 15 000. Estimates of 
the total number of unique family names in the US put 
the figure at around 1.5 million. So in this case 75% of 
the population have names in the most common 1% — 
a very top-heavy distribution indeed. The line a = 2 
thus separates the regime in which you will with some 
frequency meet people with uncommon names from the 
regime in which you will rarely meet such people. 



E. Scale-free distributions 

A power-law distribution is also sometimes called a 
scale-free distribution. Why? Because a power law is the 
only distribution that is the same whatever scale we look 
at it on. By this we mean the following. 

Suppose we have some probability distribution p(a;) for 
a quantity x, and suppose we discover or somehow deduce 
that it satisfies the property that 



p{bx) = g{b)p{x), 



(29) 



for any b. That is, if we increase the scale or units by 
which we measure x by a factor of b, the shape of the dis- 
tribution p{x) is unchanged, except for an overall multi- 
plicative constant. Thus for instance, we might find that 
computer files of size 2kB are i as common as files of 
size IkB. Switching to measuring size in megabytes we 
also find that files of size 2MB are | as common as files 
of size 1MB. Thus the shape of the file-size distribution 
curve (at least for these particular values) does not de- 
pend on the scale on which we measure file size. 

This scale-free property is certainly not true of most 
distributions. It is not true for instance of the exponen- 
tial distribution. In fact, as we now show, it is only true 
of one type of distribution, the power law. 

Starting from Eq. let us first set a; = 1, giving 

p{b) = g{b)p{l). Thus g{b) = p{b)/p{l) and ^ can be 
written as 



pibx) 



P(l) ' 



(30) 



Since this equation is supposed to be true for any 6, we 
can differentiate both sides with respect to b to get 



xp'{bx) 



P'{b)p{x) 



(31) 



where p' indicates the derivative of p with respect to its 
argument. Now we set & = 1 and get 



dx pll) 



p{x). 



(32) 



This is a simple first-order differential equation which has 
the solution 



hip(x) 



p'{l) 



In a; -I- constant. 



(33) 
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Setting a; = 1 we find that the constant is simply Inp(l), 
and then taking exponentials of both sides 



(34) 



where a — --p{l)/p'{l). Thus, as advertised, the power- 
law distribution is the only function satisfying the scale- 
free criterion (|29|l . 

This fact is more than just a curiosity. As we will 
see in Section llV.EI there are some systems that become 
scale-free for certain special values of their governing pa- 
rameters. The point defined by such a special value is 
called a "continuous phase transition" and the argument 
given above implies that at such a point the observable 
quantities in the system should adopt a power-law dis- 
tribution. This indeed is seen experimentally and the 
distributions so generated provided the original motiva- 
tion for the study of power laws in physics (although 
most experimentally observed power laws are probably 
not the result of phase transitions — a variety of other 
mechanisms produce power-law behaviour as well, as we 
will shortly see). 



F. Power laws for discrete variables 

So far I have focused on power-law distributions for 
continuous real variables, but many of the quantities we 
deal with in practical situations are in fact discrete — 
usually integers. For instance, populations of cities, num- 
bers of citations to papers or numbers of copies of books 
sold are all integer quantities. In most cases, the distinc- 
tion is not very important. The power law is obeyed only 
in the tail of the distribution where the values measured 
are so large that, to all intents and purposes, they can be 
considered continuous. Technically however, power-law 
distributions should be defined slightly differently for in- 
teger quantities. 

If k is an integer variable, then one way to proceed is 
to declare that it follows a power law if the probability pk 
of measuring the value k obeys 



Pk 



Ck- 



(35) 



for some constant exponent a. Clearly this distribution 
cannot hold all the way down to fc = 0, since it diverges 
there, but it could in theory hold down to fc = 1. If we 
discard any data for fc = 0, the constant C would then 
be given by the normalization condition 



l = ^Pfe=C7^fc-" = CC(c 



(36) 



k=l 



where C,{a) is the Riemann C- function. Rearranging, we 
find that C = 1/C(a) and 



Pk 



C(«) 



(37) 



If, as is usually the case, the power-law behaviour is seen 
only in the tail of the distribution, for values k > fcmin, 
then the equivalent expression is 



Pk 



C{a, kmin) 



(38) 



where ({a, fcmin) — X]fe^fe,„i„ ^ " generalized or 

incomplete ^-function. 

Most of the results of the previous sections can be gen- 
eralized to the case of discrete variables, although the 
mathematics is usually harder and often involves special 
functions in place of the more tractable integrals of the 
continuous case. 

It has occasionally been proposed that Eq. H35|l is not 
the best generalization of the power law to the discrete 
case. An alternative and often more convenient form is 



Pk^C 



r(fc)r(a) 

r{k + a) 



CB(fc,a), 



(39) 



where B(a, b) is, as before, the Legendre beta-function, 
Eq. ijSnil- As mentioned in Section IIII.CI the beta- 
function behaves as a power law B(fc, a) ^ k^°' for large k 
and so the distribution has the desired asymptotic form. 
Simon ^3^ proposed that Eq. H39() be called the Yule dis- 
tribution, after Udny Yule who derived it as the limiting 
distribution in a certain stochastic process |3^, and this 
name is often used today. Yule's result is described in 
Section HVm 

The Yule distribution is nice because sums involving it 
can frequently be performed in closed form, where sums 
involving Eq. H35|l can only be written in terms of special 
functions. For instance, the normalizing constant C for 
the Yule distribution is given by 



i = cj2^{k, 



c 



k=l 



and hence C — a — 1 and 



Pk^ (a - l)B(fc,a). 



(40) 



(41) 



The first and second moments (i.e., the mean and mean 
square of the distribution) are 



(fc) 



a- 1 
a - 2' 



(a-2)(a-3)' 



(42) 



and there are similarly simple expressions corresponding 
to many of our earlier results for the continuous case. 



IV. MECHANISMS FOR GENERATING POWER-LAW 
DISTRIBUTIONS 

In this section we look at possible candidate mech- 
anisms by which power-law distributions might arise in 
natural and man-made systems. Some of the possibilities 
that have been suggested are quite complex — notably the 
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physics of critical phenomena and the tools of the renor- 
malization group that are used to analyse it. But let us 
start with some simple algebraic methods of generating 
power-law functions and progress to the more involved 
mechanisms later. 



A. Combinations of exponentials 

A much more common distribution than the power law 
is the exponential, which arises in many circumstances, 
such as survival times for decaying atomic nuclei or the 
Boltzmann distribution of energies in statistical mechan- 
ics. Suppose some quantity y has an exponential distri- 
bution: 



Piy) ~ e'^y 



(43) 



The constant a might be either negative or positive. If 
it is positive then there must also be a cutoff on the 
distribution — a limit on the maximum value of y — so that 
the distribution is normalizable. 

Now suppose that the real quantity we are interested in 
is not y but some other quantity x, which is exponentially 
related to y thus: 



(44) 



with b another constant, also either positive or negative. 
Then the probability distribution of x is 



-l+a/b 



(45) 



which is a power law with exponent a — 1 — a/b. 

A version of this mechanism was used by Miller [s^ to 
explain the power-law distribution of the frequencies of 
words as follows (see also 38]). Suppose wc type ran- 
domly on a typewriter, ^"'^ pressing the space bar with 
probability Qs per stroke and each letter with equal prob- 
ability qi per stroke. If there are m letters in the alpha- 
bet then qi = (I ~ qa)/m. (In this simplest version of the 
argument we also type no punctuation, digits or other 
non-letter symbols.) Then the frequency x with which 
a particular word with y letters (followed by a space) 
occurs is 



1 - 



(46) 



where b = ln(l — qs) — In to. The number (or fraction) of 
distinct possible words with length between y and y + dy 
goes up exponentially as p{y) ^ — e°^ with a = In to. 



This argument is sometimes called the "monkeys with typewrit- 
ers" argument, the monkey being the traditional exemplar of a 
random typist. 



Thus, following our argument above, the distribution of 
frequencies of words has the form p{x) ~ x~°' with 



a 2 In TO — ln(l — (7s 

a = \ — — = ^ 

h In TO — ln(l — qs) 



(47) 



For the typical case where to is reasonably large and q^ 
quite small this gives a ~ 2 in approximate agreement 
with Table n 

This is a reasonable theory as far as it goes, but real 
text is not made up of random letters. Most combina- 
tions of letters don't occur in natural languages; most are 
not even pronounceable. We might imagine that some 
constant fraction of possible letter sequences of a given 
length would correspond to real words and the argument 
above would then work just fine when applied to that 
fraction, but upon reflection this suggestion is obviously 
bogus. It is clear for instance that very long words sim- 
ply don't exist in most languages, although there are ex- 
ponentially many possible combinations of letters avail- 
able to make them up. This observation is backed up 
by empirical data. In Fig. we show a histogram of 
the lengths of words occurring in the text of Moby Dick, 
and one would need a particularly vivid imagination to 
convince oneself that this histogram follows anything like 
the exponential assumed by Miller's argument. (In fact, 
the curve appears roughly to follow a log-normal (33|.') 

There may still be some merit in Miller's argument 
however. The problem may be that we are measuring 
word "length" in the wrong units. Letters are not really 
the basic units of language. Some basic units are letters, 
but some are groups of letters. The letters "th" for ex- 
ample often occur together in English and make a single 
sound, so perhaps they should be considered to be a sep- 
arate symbol in their own right and contribute only one 
unit to the word length? 

Following this idea to its logical conclusion we 
can imagine replacing each fundamental unit of the 
language — whatever that is — by its own symbol and then 
measuring lengths in terms of numbers of symbols. The 
pursuit of ideas along these lines led Claude Shannon 
in the 1940s to develop the field of information the- 
ory, which gives a precise prescription for calculating the 
number of symbols necessary to transmit words or any 
other data ^39, . .4C| | . The units of information are bits and 
the true "length" of a word can be considered to be the 
number of bits of information it carries. Shannon showed 
that if we regard words as the basic divisions of a mes- 
sage, the information y carried by any particular word 
is 



y 



-fclnx, 



(48) 



where x is the frequency of the word as before and k is 
a constant. (The reader interested in finding out more 
about where this simple relation comes from is recom- 
mended to look at the excellent introduction to informa- 
tion theory by Cover and Thomas 41].) 

But this has precisely the form that we want. Inverting 
it we have x = eT'^'l^ and if the probability distribution of 
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length in letters information in bits 



B. Inverses of quantities 

Suppose some quantity y has a distribution p(y) that 
passes through zero, thus having both positive and neg- 
ative values. And suppose further that the quantity we 
are reahy interested in is the reciprocal x — l/y, which 
will have distribution 

The large values of x, those in the tail of the distribution, 
correspond to the small values of y close to zero and thus 
the large- a; tail is given by 



FIG. 7 (a) Histogram of the lengths in letters of all distinct 
words in the text of the novel Moby Dick, (b) Histogram of 
the information content a la Shannon of words in Moby Dick. 
The former does not, by any stretch of the imagination, follow 
an exponential, but the latter could easily be said to do so. 
(Note that the vertical axes are logarithmic.) 



the "lengths" measured in terms of bits is also exponen- 
tial as in Eq. H43() we will get our power-law distribution. 
Figure 03 shows the latter distribution, and indeed it 
follows a nice exponential — much better than Fig. 

This is still not an entirely satisfactory explanation. 
Having made the shift from pure word length to informa- 
tion content, our simple count of the number of words of 
length y — that it goes exponentially as m"^ — is no longer 
valid, and now we need some reason why there should be 
exponentially more distinct words in the language of high 
information content than of low. That this is the case is 
experimentally verified by Fig. ^}p, but the reason must 
be considered still a matter of debate. Some possibilities 
are discussed by, for instance, Mandelbrot |42| and more 
recently by Mitzenmacher [19^ . 

Another example of the "combination of exponentials" 
mechanism has been discussed by Reed and Hughes ^3 ■ 
They consider a process in which a set of items, piles or 
groups each grows exponentially in time, having size x ^ 
e''* with 6 > 0. For instance, populations of organisms 
reproducing freely without resource constraints grow ex- 
ponentially. Items also have some fixed probability of 
dying per unit time (populations might have a stochas- 
tically constant probability of extinction), so that the 
times t at which they die are exponentially distributed 
p{t) ~ e''* with a < 0. 

These functions again follow the form of Eqs. H43|) 
and (|44|l and result in a power-law distribution of the 
sizes X of the items or groups at the time they die. Reed 
and Hughes suggest that variations on this argument may 
explain the sizes of biological taxa, incomes and cities, 
among other things. 



p{x)^x-\ (50) 

where the constant of proportionality is p{y = 0). 

More generally, any quantity x = y~'^ for some 7 will 
have a power-law tail to its distribution p(x) ~ a;~", with 
a = 1 + 1/7. It is not clear who the first author or authors 
were to describe this mechanism, but clear descriptions 
have been given recently by Bouchaud 44] , Jan et al. |45l | 
and Sornette [i^ . 

One might argue that this mechanism merely generates 
a power law by assuming another one: the power-law re- 
lationship between x and y generates a power-law distri- 
bution for X. This is true, but the point is that the mecha- 
nism takes some physical power-law relationship between 
X and y — not a stochastic probability distribution — and 
from that generates a power-law probability distribution. 
This is a non-trivial result. 

One circumstance in which this mechanism arises is 
in measurements of the fractional change in a quantity. 
For instance, Jan et al. (45j consider one of the most 
famous systems in theoretical physics, the Ising model of 
a magnet. In its paramagnetic phase, the Ising model has 
a magnetization that fluctuates around zero. Suppose we 
measure the magnetization m at uniform intervals and 
calculate the fractional change S — {Am)/m between 
each successive pair of measurements. The change Am 
is roughly normally distributed and has a typical size set 
by the width of that normal distribution. The l/m on the 
other hand produces a power-law tail when small values 
of m coincide with large values of Am, so that the tail of 
the distribution of S follows p{S) ^ as above. 

In Fig. |S1 I show a cumulative histogram of mea- 
surements of S for simulations of the Ising model on a 
square lattice, and the power-law distribution is clearly 
visible. Using Eq. (|SJ|, the value of the exponent is 
a = 1.98 ± 0.04, in good agreement with the expected 
value of 2. 



A correspondent tells me that a similar mechanism was described 
in an astrophysical context by Chandrasekhar in a paper in 1943, 
but I have been unable to confirm this. 
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FIG. 9 The position of a one-dimensional random walker (ver- 
tical axis) as a function of time (horizontal axis). The proba- 
bility M2n that the walk returns to zero at time t = 2n is equal 
to the probability /2m that it returns to zero for the first time 
at some earlier time t = 2m, multiplied by the probability 
U2n-2m that it rctums again a time 2n — 2m later, summed 
over all possible values of m. We can use this observation 
to write a consistency relation, Eq. 15111 . that can be solved 
for ft, Eq. 



FIG. 8 Cumulative histogram of the magnetization fluctu- 
ations of a 128 X 128 nearest-neighbour Ising model on a 
square lattice. The model was simulated at a tempera- 
ture of 2.5 times the spin-spin coupling for 100 000 time 
steps using the cluster algorithm of Swendsen and Wang 
and the magnetization per spin measured at intervals of 
ten steps. The fluctuations were calculated as the ratio 
5i = 2(mi+i - mi)/(mi+i 4- rrn). 



arguments since there is no way to get back to zero in 
any odd number of steps. 

As Fig. |51 illustrates, the probability ut — U2m with n 
integer, can be written 



U2n 



if n = 0, 



Z]m=l hmU2n-2m if n > 1, 



(51) 



C. Random walks 

Many properties of random walks are distributed ac- 
cording to power laws, and this could explain some 
power-law distributions observed in nature. In particu- 
lar, a randomly fluctuating process that undergoes "gam- 



bler's ruin' 



" 13 



i.e., that ends when it hits zero, has a 



power-law distribution of possible lifetimes. 

Consider a random walk in one dimension, in which a 
walker takes a single step randomly one way or the other 
along a line in each unit of time. Suppose the walker 
starts at position on the line and let us ask what the 
probability is that the walker returns to position for the 
first time at time t (i.e., after exactly t steps). This is the 
so-called first return time of the walk and represents the 
lifetime of a gambler's ruin process. A trick for answering 
this question is depicted in Fig. We consider first the 
unconstrained problem in which the walk is allowed to 
return to zero as many times as it likes, before returning 
there again at time t. Let us denote the probability of 
this event as Ut- Let us also denote by /* the probability 
that the first return time is t. We note that both of these 
probabilities are non-zero only for even values of their 



Gambler's ruin is so called because a gambler's night of betting 
ends when his or her supply of money hits zero (assuming the 
gambling establishment declines to offer him or her a line of 
credit) . 



where m is also an integer and we define /o = and 
Mo = 1. This equation can conveniently be solved for J2n 
using a generating function approach. We define 



U{z) = "2«^", F{z) - (52) 

n=0 n— 1 

Then, multiplying Eq. (|51|l throughout by z" and sum- 
ming, we find 



U{z) = 1 + X! X! hmU2n-2r, 



n—1 m— 1 



— 1 + f27nZ^^ U2n-2r, 

m— 1 n—rn 

= l + F{z)U{z). 



So 



F{z) = 1 



U{z) 



(53) 



(54) 



The function U{z) however is quite easy to calculate. 
The probability U2n that we are at position zero after 2n 
steps is 



U2n = 2" 



2n 



(55) 
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-^14 



ji=0 



2n\ z 



n J 4" 



And hence 



F{z) = i-VT 



(56) 



(57) 



Expanding this function using the binomial theorem 
thus: 



F{z) = 



1^ I 2 2 „2 I 2 2 2 „3 



2- ' 2! 
oo /2n\ 
\n) 

^ (2n- 1)22"' 



3! 



(58) 



and comparing this expression with Eq. H52|) . we imme- 
diately see that 



h 



(2n- 1)22"' 



(59) 



and we have our solution for the distribution of first re- 
turn times. 

Now consider the form of jin for large n. Writing out 
the binomial coefficient as — (27i)!/(n!)2, we take 
logs thus: 

ln/2„ = ln(2n)! - 2 Inn! - 2nln2 - ln(2n - 1), (60) 

and use Sterling's formula Inn! ~ nlnn — n -t- ^ Inn to 
get In /2ra ~ iln2 — ^Inn — ln(2n — 1), or 



^2n 



n{2n- 1)2 



(61) 



In the limit n — cxi, this implies that f2n ~ n or 
equivalently 



-3/2 



(62) 



So the distribution of return times follows a power law 
with exponent a — ^. Note that the distribution has a 
divergent mean (because a < 2). As discussed in Sec- 
tion nil. CI this implies that the mean is finite for any 
finite sample but can take very different values for dif- 
ferent samples, so that the value measured for any one 
sample gives little or no information about the value for 
any other. 

As an example application, the random walk can be 
considered a simple model for the lifetime of biological 
taxa. A taxon is a branch of the evolutionary tree, a 



The enthusiastic reader can easily derive this result for him or 
herself by expanding (1 — z)~^f'^ using the binomial theorem. 



group of species all descended by repeated speciation 
from a common ancestor. The ranks of the Linnean 
hierarchy — genus, family, order and so forth — are exam- 
ples of taxa. If a taxon gains and loses species at random 
over time, then the number of species performs a ran- 
dom walk, the taxon becoming extinct when the number 
of species reaches zero for the first (and only) time. (This 
is one example of "gambler's ruin".) Thus the time for 
which taxa live should have the same distribution as the 
first return times of random walks. 

In fact, it has been argued that the distribution of the 
lifetimes of gene ra in the fossil record does indeed follow 
a power law |43| . The best fits to the available fossil data 
put the value of the exponent at a = 1.7 ± 0.3, which is 
in agreement with the simple random walk model |49||."'^^ 



D. The Yule process 

One of the most convincing and widely applicable 
mechanisms for generating power laws is the Yule pro- 
cess, whose invention was, coincidentally, also inspired 
by observations of the statistics of biological taxa as dis- 
cussed in the previous section. 

In addition to having a (possibly) power-law distribu- 
tion of lifetimes, biological taxa also have a very convinc- 
ing power-law distribution of sizes. That is, the distribu- 
tion of the number of species in a genus, family or other 
taxonomic group appears to follow a power law quite 
closely. This phenomenon was first reported by Willis 
and Yule in 1922 for the example of flowering plants [T5l |. 
Three years later. Yule "36] offered an explanation using 
a simple model that has since found wide application in 
other areas. He argued as follows. 

Suppose first that new species appear but they never 
die; species are only ever added to genera and never re- 
moved. This differs from the random walk model of the 
last section, and certainly from reality as well. It is be- 
lieved that in practice all species and all genera become 
extinct in the end. But let us persevere; there is nonethe- 
less much of worth in Yule's simple model. 

Species are added to genera by speciation, the splitting 
of one species into two, which is known to happen by a va- 



Modern phylogenetic analysis, the quantitative comparison of 
species' genetic material, can provide a picture of the evolution- 
ary tree and hence allow the accurate "cladistic" assignment of 
species to taxa. For prehistoric species, however, whose genetic 
material is not usually available, determination of evolutionary 
ancestry is difficult, so classification into taxa is based instead 
on morphology, i.e., on the shapes of organisms. It is widely ac- 
knowledged that such classifications are subjective and that the 
taxonomic assignments of fossil species are probably riddled with 
errors. 

To be fair, I consider the power law for the distribution of genus 
lifetimes to fall in the category of "tenuous" identifications to 
which I alluded in footnote 13 This theory should be taken with 
a pinch of salt. 
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riety of mechanisms, including competition for resources, 
spatial separation of breeding populations and genetic 
drift. If we assume that this happens at some stochasti- 
cally constant rate, then it follows that a genus with k 
species in it will gain new species at a rate proportional 
to k, since each of the k species has the same chance per 
unit time of dividing in two. Let us further suppose that 
occasionally, say once every m speciation events, the new 
species produced is, by chance, sufficiently different from 
the others in its genus as to be considered the founder 
member of an entire new genus. (To be clear, we define 
m such that m species are added to pre-existing genera 
and then one species forms a new genus. So m -I- 1 new 
species appear for each new genus and there are m + 1 
species per genus on average.) Thus the number of gen- 
era goes up steadily in this model, as does the number of 
species within each genus. 

We can analyse this Yule process mathematically as 
follows. Let us measure the passage of time in the 
model by the number of genera n. At each time-step 
one new species founds a new genus, thereby increasing 
n by 1, and m other species are added to various pre- 
existing genera which are selected in proportion to the 
number of species they already have. We denote by pk,n 
the fraction of genera that have k species when the total 
number of genera is n. Thus the number of such genera 
is npk^n- We now ask what the probability is that the 
next species added to the system happens to be added to 
a particular genus i having ki species in it already. This 
probability is proportional to fc^, and so when properly 
normalized is just ki/J2i ^i- Si is simply the to- 
tal number of species, which is n(m -|- 1). Furthermore, 
between the appearance of the nth and the {n + l)th 
genera, m other new species are added, so the probabil- 
ity that genus i gains a new species during this interval is 
mki/{n{m + 1)). And the total expected number of gen- 
era of size k that gain a new species in the same interval 
is 



mk 



n{m + 1) 



X npk,n 



-kpk 



(63) 



Now we observe that the number of genera with k 
species will decrease on each time step by exactly this 
number, since by gaining a new species they become gen- 
era with fc -|- 1 instead. At the same time the number 
increases because of species that previously had k — 1 
species and now have an extra one. Thus we can write 
a master equation for the new number (n + l)pk.n+i of 



Yule's analysis of the process was considerably more involved 
than the one presented here, essentially because the theory of 
stochastic processes as we now know it did not yet exist in his 
time. The master equation method we employ is a rel ativ ely 
modern innovation, introduced in this context by Simon l35l . 



genera with k species thus: 

{n + l)pk,n+l = npk,n H [{k - l)pk~l.n - kpk,n] ■ 

TO -|- 1 

(64) 

The only exception to this equation is for genera of size 1, 
which instead obey the equation 



{n + l)pi,„+i = npi,n + 1 - 



TO 



TO + 1 



Pl,n, 



(65) 



since by definition exactly one new such genus appears 
on each time step. 

Now we ask what form the distribution of the sizes of 
genera takes in the limit of long times. To do this we 
allow n — > cxD and assume that the distribution tends 
to some fixed value pk = lim„^oo Pn,k independent of n. 
Then Eq. (|65|l becomes pi = l — mpi/{m +1), which has 
the solution 



Pi 



TO + 1 

2to + 1 ' 



And Eq. H64(l becomes 



Pk 



— [{k ~ l)pk-i - kpk], 



which can be rearranged to read 

k - 1 

Pk = 



■Pk-i, 



k + l + Xjm' 
and then iterated to get 

(fc- l)(A:-2)...l 



Pk 



(fc + 1 + 1/to)(A: + 1/to) . . . (3 + 1/to) 
(fc - 1) ... 1 



= (1 + 1/to) 



(fc -1-1 + 1/to)... (2 + 1/to) 



(66) 
(67) 
(68) 

P\ 

, (69) 



where I have made use of Eq. (|66|) . This can be simpli- 
fied further by making use of a handy property of the 
F-function, Eq. (EH, that F(a) = (a - l)F(a - 1). Using 
this, and noting that F(l) — 1, we get 



Pfc = (1 + 1/to) 



F(fc)F(2 + 1/to) 



F(fc + 2 + l/TO) 

= (l + l/TO)B(fc,2+l/TO), 



(70) 



where B(a,5) is again the beta-function, Eq. H20() . This, 
we note, is precisely the distribution defined in Eq. H39|l . 
which Simon called the Yule distribution. Since the beta- 
function has a power-law tail B(a, 5) ~ a"'', we can im- 
mediately see that pk also has a power-law tail with an 
exponent 



a = 2+ — . 

TO 



(71) 



The mean number to + 1 of species per genus for the 
example of flowering plants is about 3, making to ~ 2 
and a ~ 2.5. The actual exponent for the distribution 
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found by Willis and Yule isa = 2.5±0.1, which is 
in excellent agreement with the theory. 

Most likely this agreement is fortuitous, however. The 
Yule process is probably not a terribly realistic expla- 
nation for the distribution of the sizes of genera, princi- 
pally because it ignores the fact that species (and gen- 
era) become extinct. However, it has been adapted and 
generalized by others to explain power laws in many 
other systems, most famously city sizes [s^, paper ci- 
tations |5fl , and links to pages on the world wide 
web [s^. l53j. The most general form of the Yule process 
is as follows. 

Suppose we have a system composed of a collection of 
objects, such as genera, cities, papers, web pages and so 
forth. New objects appear every once in a while as cities 
grow up or people publish new papers. Each object also 
has some property k associated with it, such as number of 
species in a genus, people in a city or citations to a paper, 
that is reputed to obey a power law, and it is this power 
law that we wish to explain. Newly appearing objects 
have some initial value of k which we will denote fcg- 
New genera initially have only a single species ko — 1, 
but new towns or cities might have quite a large initial 
population — a single person living in a house somewhere 
is unlikely to constitute a town in their own right but 
ko = 100 people might do so. The value of ko can also be 
zero in some cases: newly published papers usually have 



zero citations for instance. 

In between the appearance of one object and the next, 
m new species/people/citations etc. are added to the en- 
tire system. That is some cities or papers will get new 
people or citations, but not necessarily all will. And in 
the simplest case these are added to objects in propor- 
tion to the number that the object already has. Thus 
the probability of a city gaining a new member is pro- 
portional to the number already there; the probability 
of a paper getting a new citation is proportional to the 
number it already has. In many cases this seems like a 
natural process. For example, a paper that already has 
many citations is more likely to be discovered during a 
literature search and hence more likely to be cited again. 
Simon "SS*! dubbed this type of "rich-get-richer" process 
the Gibrat principle. Elsewhere it also goes by the names 
of the Matthew effect l54| . cumulative advantage |50j| . or 
preferential attachment |53|- 

There is a problem however when ko — 0. For example, 
if new papers appear with no citations and garner cita- 
tions in proportion to the number they currently have, 
which is zero, then no paper will ever get any citations! 
To overcome this problem one typically assigns new cita- 
tions not in proportion simply to fc, but to A; -|- c, where 
c is some constant. Thus there are three parameters fco, 
c and m that control the behaviour of the model. 



By an argument exactly analogous to the one given above, one can then derive the master equation 



(n -I- l)pfc,„+i = npk^n + m 



1-hc 



ko 



-Pk-i,'. 



m 



ko 



-Pk,', 



for k > ko, 



and 



{n + l)pk„^n+i = npk„,n + 1 - m 



ko 



ko 



for k = ko 



(72) 



(73) 



(Note that k is never less than ko, since each object appears with k — kp initially.) 



Looking for stationary solutions of these equations as 
before, we define pk = lim„^ooPn,fc and find that 



Pko 



fco + c + m 



and 



Pk 



"° (m + l)(fco + c)+m' 

(fc- l + c)(fc-2 + c)...(fco + c) 
{k - I + c + a){k - 2 + c + a) . . . {ko + c + a) 
r{k + c)r{ko + c + a) 



r{ko + c)T{k + c + a) 



Pko, 



(74) 

Pko 

(75) 



where I have made use of the F-function notation intro- 
duced for Eq. (|70|l and, for reasons that will become clear 
in just moment, I have defined a = 2 + {ko + c)/m. As 
before, this expression can also be written in terms of the 



beta- function, Ea. l(^ : 



Pk 



B(fc + c, a) 
B(fco + c, a) 



Pko- 



(76) 



Since the beta-function follows a power law in its tail, 
B(a, 6) ~ a~^, the general Yule process generates a 
power-law distribution pj. ~ with exponent related 
to the three parameters of the process according to 



ko 



(77) 



For example, the original Yule process for number of 
species per genus has c = and ko — 1, which reproduces 
the result of Eq. (|71|) . For citations of papers or links to 
web pages we have fco = and we must have c > to get 
any citations or links at all. So a = 2 + c/m. In his work 
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on citations Price 50] assumed that c = 1, so that paper 
citations have the same exponent a — 2 + 1/m as the 
standard Yule process, although there doesn't seem to be 
any very good reason for making this assumption. As we 
saw m Table n (and as Price himself also reported), real 
citations seem to have an exponent a ~ 3, so we should 
expect c ~ m. For the data from the Science Citation 
Index examined in Section IlI.AI the mean number m of 
citations per paper is 8.6. So we should put c ~ 8.6 
too if we want the Yule process to match the observed 
exponent. 

The most widely studied model of links on the web, 
that of Barabasi and Albert [s^ , assumes c = m so that 
a = 3, but again there doesn't seem to be a good reason 
for this assumption. The measured exponent for numbers 
of links to web sites is about a = 2.2, so if the Yule 
process is to match the data in this case, we should put 
c~ 0.2m. 

However, the important point is that the Yule process 
is a plausible and general mechanism that can explain a 
number of the power-law distributions observed in nature 
and can produce a wide range of exponents to match the 
observations by suitable adjustments of the parameters. 
For several of the distributions shown in Fig.01 especially 
citations, city populations and personal income, it is now 
the most widely accepted theory. 



E. Phase transitions and critical phenomena 

A completely different mechanism for generating power 
laws, one that has received a huge amount of attention 
over the past few decades from the physics community, 
is that of critical phenomena. 

Some systems have only a single macroscopic length- 
scale, size-scale or time-scale governing them. A classic 
example is a magnet, which has a correlation length that 
measures the typical size of magnetic domains. Under 
certain circumstances this length-scale can diverge, leav- 
ing the system with no scale at all. As we will now see, 
such a system is "scale- free" in the sense of Section lllLEI 
and hence the distributions of macroscopic physical quan- 
tities have to follow power laws. Usually the circum- 
stances under which the divergence takes place are very 
specific ones. The parameters of the system have to be 
tuned very precisely to produce the power-law behaviour. 
This is something of a disadvantage; it makes the diver- 
gence of length-scales an unlikely explanation for generic 
power-law distributions of the type highlighted in this 
paper. As we will shortly see, however, there are some 
elegant and interesting ways around this problem. 

The precise point at which the length-scale in a sys- 
tem diverges is called a critical point or a phase transi- 
tion. More specifically it is a continuous phase transi- 
tion. (There are other kinds of phase transitions too.) 
Things that happen in the vicinity of continuous phase 
transitions are known as critical phenomena, of which 
power-law distributions are one example. 



FIG. 10 The percolation model on a square lattice: squares 
on the lattice are coloured in independently at random with 
some probability p. In this example p = 1 . 



To better understand the physics of critical phenom- 
ena, let us explore one simple but instructive example, 
that of the "percolation transition" . Consider a square 
lattice like the one depicted in Fig. ^| in which some of 
the squares have been coloured in. Suppose we colour 
each square with independent probability p, so that on 
average a fraction p of them are coloured in. Now we look 
at the clusters of coloured squares that form, i.e., the con- 
tiguous regions of adjacent coloured squares. We can ask, 
for instance, what the mean area (s) is of the cluster to 
which a randomly chosen square belongs. If that square 
is not coloured in then the area is zero. If it is coloured 
in but none of the adjacent ones is coloured in then the 
area is one, and so forth. 

When p is small, only a few squares are coloured in 
and most coloured squares will be alone on the lattice, 
or maybe grouped in twos or threes. So (s) will be small. 
This situation is depicted in Fig. El for p — 0.3. Con- 
versely, if p is large — almost 1, which is the largest value 
it can have — then most squares will be coloured in and 
they will almost all be connected together in one large 
cluster, the so-called spanning cluster. In this situation 
we say that the system percolates. Now the mean size 
of the cluster to which a vertex belongs is limited only 
by the size of the lattice itself and as we let the lattice 
size become large (s) also becomes large. So we have two 
distinctly different behaviours, one for small p in which 
(s) is small and doesn't depend on the size of the sys- 
tem, and one for large p in which (s) is much larger and 
increases with the size of the system. 

And what happens in between these two extremes? 
As we increase p from small values, the value of (s) also 
increases. But at some point we reach the start of the 
regime in which (s) goes up with system size instead of 
staying constant. We now know that this point is at p = 
0.5927462 . . ., which is called the critical value of p and 
is denoted Pc. If the size of the lattice is large, then (s) 
also becomes large at this point, and in the limit where 
the lattice size goes to infinity (s) actually diverges. To 
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FIG. 11 Three examples of percolation systems on 100 x 100 square lattices with p — 0.3, p — Pc = 0.5927 . . . and p — 0.9. The 
first and last are well below and above the critical point respectively, while the middle example is precisely at it. 



illustrate this phenomenon, I show in Fig. E| a plot of 
(s) from simulations of the percolation model and the 
divergence is clear. 

Now consider not just the mean cluster size but the en- 
tire distribution of cluster sizes. Let p(s) be the probabil- 
ity that a randomly chosen square belongs to a cluster of 
area s. In general, what forms can p{s) take as a function 
of s? The important point to notice is that p{s), being 
a probability distribution, is a dimensionless quantity — 
just a number — but s is an area. We could measure s in 
terms of square metres, or whatever units the lattice is 
calibrated in. The average (s) is also an area and then 
there is the area of a unit square itself, which we will de- 
note a. Other than these three quantities, however, there 
are no other independent parameters with dimensions in 
this problem. (There is the area of the whole lattice, but 
we are considering the limit where that becomes infinite, 
so it's out of the picture.) 

If we want to make a dimensionless function p{s) out 
of these three dimensionful parameters, there are three 
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FIG. 12 The mean area of the cluster to which a randomly 
chosen square belongs for the percolation model described in 
the text, calculated from an average over 1000 simulations on 
a 1000 X 1000 square lattice. The dotted line marks the known 
position of the phase transition. 



dimensionless ratios we can form: s/a, a/ (s) and s/ (s) 
(or their reciprocals, if we prefer). Only two of these are 
independent however, since the last is the product of the 
other two. Thus in general we can write 



p{-s) = Cf 



(78) 



where / is a dimensionless mathematical function of its 
dimensionless arguments and C is a normalizing constant 
chosen so that J2sPi^) ~ ^■ 

But now here's the trick. We can coarse-grain or 
rescale our lattice so that the fundamental unit of the 
lattice changes. For instance, we could double the size of 
our unit square a. The kind of picture I'm thinking of 
is shown in Fig. 1131 The basic percolation clusters stay 
roughly the same size and shape, although I've had to 
fudge things around the edges a bit to make it work. For 
this reason this argument will only be strictly correct for 
large clusters s whose area is not changed appreciably by 
the fudging. (And the argument thus only tells us that 
the tail of the distribution is a power law, and not the 
whole distribution.) 
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FIG. 13 A site percolation system is coarse-grained, so that 
the area of the fundamental square is (in this case) quadru- 
pled. The occupation of the squares in the coarse-grained 
lattice (right) is chosen to mirror as nearly as possible that of 
the squares on the original lattice (left), so that the sizes and 
shapes of the large clusters remain roughly the same. The 
small clusters are mostly lost in the coarse-graining, so that 
the arguments given in the text are valid only for the large-s 
tail of the cluster size distribution. 
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The probability p{s) of getting a cluster of area s is 
unchanged by the coarse-graining since the areas them- 
selves are, to a good approximation, unchanged, and the 
mean cluster size is thus also unchanged. All that has 
changed, mathematically speaking, is that the unit area 
a has been rescaled a a/b for some constant rescaling 
factor b. The equivalent of Eq. H78|) in our coarse-grained 
system is 

Comparing with Eq. (|78|l . we can see that this is equal, to 
within a multiplicative constant, to the probability p(6s) 
of getting a cluster of size 6s, but in a system with a 
different mean cluster size of b (s). Thus we have related 
the probabilities of two different sizes of clusters to one 
another, but on systems with different average cluster 
size and hence presumably also different site occupation 
probability. Note that the normalization constant must 
in general be changed in Eq. H79|l to make sure that p{s) 
still sums to unity, and that this change will depend on 
the value we choose for the rescaling factor b. 

But now we notice that there is one special point at 
which this rescaling by definition does not result in a 
change in (s) or a corresponding change in the site occu- 
pation probability, and that is the critical point. When 
we are precisely at the point at which (s) — > oo, then 
b (s) = (s) by definition. Putting (s) ^ oo in Eqs. 178|l 
and 113, we then get p{s) = C'f{bs/a, 0) = {C'/C)pibs). 
Or equivalently 

p{bs) = 9ib)p{s), (80) 

where g{b) = C/C . Comparing with Eq. (|29|l we see that 
this has precisely the form of the equation that defines a 
scale-free distribution. The rest of the derivation below 
Eq. (|29|l follows immediately, and so we know that p{s) 
must follow a power law. 

This in fact is the origin of the name "scale-free" for a 
distribution of the form H29|) . At the point at which (s) 
diverges, the system is left with no defining size-scale, 
other than the unit of area a itself. It is "scale-free" , and 
by the argument above it follows that the distribution of 
s must obey a power law. 

In Fig. ll4I I show an example of a cumulative distribu- 
tion of cluster sizes for a percolation system right at the 
critical point and, as the figure shows, the distribution 
does indeed follow a power law. Technically the distribu- 
tion cannot follow a power law to arbitrarily large cluster 
sizes since the area of a cluster can be no bigger than the 
area of the whole lattice, so the power-law distribution 
will be cut off in the tail. This is an example of a finite- 
size effect. This point does not seem to be visible in 
Fig. El however. 

The kinds of arguments given in this section can be 
made more precise using the machinery of the renor- 
malization group. The real-space renormalization group 
makes use precisely of transformations such as that 
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FIG. 14 Cumulative distribution of the sizes of clusters for 
(site) percolation on a square lattice of 40 000 x 40 000 sites 
at the critical site occupation probability pc = 0.592746 . . . 

shown in Fig. 1131 to derive power-law forms and their 
exponents for distributions at the critical point. An ex- 
ample application to the percolation problem is given by 
Reynolds et al. [s^. A more technically sophisticated 
technique is the k-space renormalization group, which 
makes use of transformations in Fourier space to accom- 
plish similar aims in a particularly elegant formal envi- 
ronment |56l] . 

F. Self-organized criticality 

As discussed in the preceding section, certain sys- 
tems develop power-law distributions at special "critical" 
points in their parameter space because of the divergence 
of some characteristic scale, such as the mean cluster size 
in the percolation model. This does not, however, pro- 
vide a plausible explanation for the origin of power laws 
in most real systems. Even if we could come up with some 
model of earthquakes or solar flares or web hits that had 
such a divergence, it seems unlikely that the parameters 
of the real world would, just coincidentally, fall precisely 
at the point where the divergence occurred. 

As first proposed by Bak et al. [s^l, however, it is possi- 
ble that some dynamical systems actually arrange them- 
selves so that they always sit at the critical point, no 
matter what state we start off in. One says that such 
systems self-organize to the critical point, or that they 
display self- organized criticality. A now-classic example 
of such a system is the forest fire model of Drosscl and 
Schwabl |58| , which is based on the percolation model we 
have already seen. 

Consider the percolation model as a primitive model 
of a forest. The lattice represents the landscape and a 
single tree can grow in each square. Occupied squares 
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FIG. 15 Lightning strikes at random positions in the forest 
fire model, starting fires that wipe out the entire cluster to 
which a struck tree belongs. 



represent trees and empty squares represent empty plots 
of land with no trees. Trees appear instantaneously at 
random at some constant rate and hence the squares of 
the lattice fill up at random. Every once in a while a 
wildfire starts at a random square on the lattice, set off 
by a lightning strike perhaps, and burns the tree in that 
square, if there is one, along with every other tree in 
the cluster connected to it. The process is illustrated in 
Fig. ^] One can think of the fire as leaping from tree 
to adjacent tree until the whole cluster is burned, but 
the fire cannot cross the firebreak formed by an empty 
square. If there is no tree in the square struck by the 
lightning, then nothing happens. After a fire, trees can 
grow up again in the squares vacated by burnt trees, so 
the process keeps going indefinitely. 

If we start with an empty lattice, trees will start to ap- 
pear but will initially be sparse and lightning strikes will 
either hit empty squares or if they do chance upon a tree 
they will burn it and its cluster, but that cluster will be 
small and localized because we are well below the perco- 
lation threshold. Thus fires will have essentially no effect 
on the forest. As time goes by however, more and more 
trees will grow up until at some point there are enough 
that we have percolation. At that point, as we have seen, 
a spanning cluster forms whose size is limited only by the 
size of the lattice, and when any tree in that cluster gets 
hit by the lightning the entire cluster will burn away. 
This gets rid of the spanning cluster so that the system 
does not percolate any more, but over time as more trees 
appear it will presumably reach percolation again, and so 
the scenario will play out repeatedly. The end result is 
that the system oscillates right around the critical point, 
first going just above the percolation threshold as trees 
appear and then being beaten back below it by fire. In 
the limit of large system size these fluctuations become 
small compared to the size of the system as a whole and 
to an excellent approximation the system just sits at the 
threshold indefinitely. Thus, if we wait long enough, we 
expect the forest fire model to self-organize to a state 
in which it has a power-law distribution of the sizes of 
clusters, or of the sizes of fires. 

In Fig. ^1 1 show the cumulative distribution of the 
sizes of fires in the forest fire model and, as we can see. 
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FIG. 16 Cumulative distribution of the sizes of "fires" in a 
simulation of the forest fire model of Drossel and Schwabl [H^ 
for a square lattice of size 5000 x 5000. 



it follows a power law closely. The exponent of the dis- 
tribution is quite small in this case. The best current 
estimates give a value of a = 1.19 ± 0.01 59J, meaning 
that the distribution has an infinite mean in the limit of 
large system size. For all real systems however the mean 
is finite: the distribution is cut off in the large-size tail be- 
cause fires cannot have a size any greater than that of the 
lattice as a whole and this makes the mean well-behaved. 
This cutoff is clearly visible in Fig. E| as the drop in the 
curve towards the right of the plot. What's more the dis- 
tribution of the sizes of fires in real forests, Fig.lSJi, shows 
a similar cutoff and is in many ways qualitatively similar 
to the distribution predicted by the model. (Real forests 
are obviously vastly more complex than the forest fire 
model, and no one is seriously suggesting that the model 
is an accurate representation the real world. Rather it 
is a guide to the general type of processes that might be 
going on in forests.) 

There has been much excitement about self-organized 
criticality as a possible generic mechanism for explaining 
where power-law distributions come from. Per Bak, one 
of the originators of the idea, wrote an entire book about 
it . Self-organized critical models have been put for- 
ward not only for forest fires, but for earthquakes l6lL l62l . 
solar flares j^], biological evolution [g^, avalanches |57j 
and many other phenomena. Although it is probably not 
the universal law that some have claimed it to be, it is cer- 
tainly a powerful and intriguing concept that potentially 
has applications to a variety of natural and man-made 
systems. 
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G. Other mechanisms for generating power laws 

In the preceding sections I've described the best 
known and most widely applied mechanisms that gener- 
ate power-law distributions. However, there are a num- 
ber of others that deserve a mention. One that has been 
receiving some attention recently is the highly optimized 
tolerance mechanism of Carlson and Doyle 64, 65] . The 
classic example of this mechanism is again a model of 
forest fires and is based on the percolation process. Sup- 
pose again that fires start at random in a grid- like forest, 
just as we considered in Sec. llV.Fl but suppose now that 
instead of appearing at random, trees are deliberately 
planted by a knowledgeable forester. One can ask what 
the best distribution of trees is to optimize the amount of 
lumber the forest produces, subject to random fires that 
could start at any place. The answer turns out to be that 
one should plant trees in blocks, with narrow firebreaks 
between them to prevent fires from spreading. Moreover, 
one should make the blocks smaller in regions where fires 
start more often and larger where fires are rare. The 
reason for this is that we waste some valuable space by 
making firebreaks, space in which we could have planted 
more trees. If fires are rare, then on average it pays to put 
the breaks further apart — more trees will burn if there is 
a fire, but we also get more lumber if there isn't. 

Carlson and Doyle show both by analytic arguments 
and by numerical simulation that for quite general dis- 
tributions of starting points for fires this process leads to 
a distribution of fire sizes that approximately follows a 
power law. The distribution is not a perfect power law 
in this case, but on the other hand neither are many of 
those seen in the data of Fig. 01 so this is not necessarily 
a disadvantage. Carlson and Doyle have proposed that 
highly optimized tolerance could be a model not only for 
forest fires but also for the sizes of files on the world wide 
web, which appear to follow a power law 

Another mechanism, which is mathematically similar 
to that of Carlson and Doyle but quite different in mo- 
tivation, is the coherent noise mechanism proposed by 
Sneppen and Newman 66J as a model of biological ex- 
tinction. In this mechanism a number of agents or species 
are subjected to stresses of various sizes, and each agent 
has a threshold for stress above which an applied stress 
will wipe that agent out — the species becomes extinct. 
Extinct species are replaced by new ones with randomly 
chosen thresholds. The net result is that the system self- 
organizes to a state where most of the surviving species 
have high thresholds, but the exact distribution depends 
on the distribution of stresses in a way very similar to the 
relation between block sizes and fire frequency in highly 
optimized tolerance. No conscious optimization is needed 
in this case, but the end result is similar: the overall dis- 
tribution of the numbers of species becoming extinct as 
a result of any particular stress approximately follows a 
power law. The power-law form is not exact, but it's as 
good as that seen in real extinction data. Sneppen and 
Newman have also suggested that their mechanism could 



be used to model avalanches and earthquakes. 

One of the broad distributions mentioned in Sec. IlLBI 
as an alternative to the power law was the log-normal. A 
log-normally distributed quantity is one whose logarithm 
is normally distributed. That is 



p{lnx) ^ exp — 



(In a: — fi)^ 



2cr2 



(81) 



for some choice of the mean /i and standard deviation a 
of the distribution. Distributions like this typically arise 
when we are multiplying together random numbers. The 
log of the product of a large number of random numbers is 
the sum of the logarithms of those same random numbers, 
and by the central limit theorem such sums have a normal 
distribution essentially regardless of the distribution of 
the individual numbers. 

But Eq. (|81|l implies that the distribution of x itself is 



p{x) = p{hix) 



din a; 1 



da; 



= — exp 

X 



{lux — pY 
2^2 



(82) 



To see how this looks if we were to plot it on log scales, 
we take logarithms of both sides, giving 



\np{x) 



Inx — 

{\nxf 
2o-2 



(lux — pY 



2a2 



- 1 



In a; — 



2a2' 



(83) 



which is quadratic in lux. However, any quadratic curve 
looks straight if we view a sufficient small portion of it, so 
p{x) will look like a power-law distribution when we look 
at a small portion on log scales. The effective exponent a 
of the distribution is in this case not fixed by the theory — 
it could be anything, depending on which part of the 
quadratic our data fall on. 

On larger scales the distribution will have some down- 
ward curvature, but so do many of the distributions 
claimed to follow power laws, so it is possible that these 
distributions are really log-normal. In fact, in many cases 
we don't even have to restrict ourselves to a particu- 
larly small a portion of the curve. If a is large then the 
quadratic term in Eq. (|83|l will vary slowly and the cur- 
vature of the line will be slight, so the distribution will 
appear to follow a power law over relatively large por- 
tions of its range. This situation arises commonly when 
we are considering products of random numbers. 

Suppose for example that we are multiplying together 
100 numbers, each of which is drawn from some distri- 
bution such that the standard deviation of the logs is 
around 1 — i.e., the numbers themselves vary up or down 
by about a factor of e. Then, by the central limit the- 
orem, the standard deviation for Inx will be tr ~ 10 
and Inx will have to vary by about ±10 for changes in 
(lna:)2/cr2 to be apparent. But such a variation in the 
logarithm corresponds to a variation in x of more than 
four orders of magnitude. If our data span a domain 
smaller than this, as many of the plots in Fig.^do, then 
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we will see a measured distribution that looks close to 
power-law. And the range will get quickly larger as the 
number of numbers we are multiplying grows. 

One example of a random multiplicative process might 
be wealth generation by investment. If a person invests 
money, for instance in the stock market, they will get 
a percentage return on their investment that varies over 
time. In other words, in each period of time their in- 
vestment is multiplied by some factor which fluctuates 
from one period to the next. If the fluctuations are ran- 
dom and uncorrelated, then after many such periods the 
value of the investment is the initial value multiplied by 
the product of a large number of random numbers, and 
therefore should be distributed according to a log- normal. 
This could explain why the tail of the wealth distribution. 
Fig. EI, appears to follow a power law. 

Another example is fragmentation. Suppose we break 
a stick of unit length into two parts at a position which is 
a random fraction z of the way along the stick's length. 
Then we break the resulting pieces at random again and 
so on. After many breaks, the length of one of the re- 
maining pieces will be Yii ^i, where Zi is the position of 
the ith break. This is a product of random numbers and 
thus the resulting distribution of lengths should follow a 
power law over a portion of its range. A mechanism like 
this could, for instance, produce a power-law distribution 
of meteors or other interplanetary rock fragments, which 
tend to break up when they collide with one another, and 
this in turn could produce a power- law distribution of the 
sizes of meteor craters similar to the one in Fig. Bb. 

In fact, as discussed by a number of authors IbM 
l69| . random multiplication processes can also generate 
perfect power-law distributions with only a slight modi- 
fication: if there is a lower bound on the value that the 
product of a set of numbers is allowed to take (for ex- 
ample if there is a "reflecting boundary" on the lower 
end of the range, or an additive noise term as well as a 
multiplicative one) then the behaviour of the process is 
modified to generate not a log-normal, but a true power 
law. 

Finally, some processes show power-law distributions 
of times between events. The distribution of times be- 
tween earthquakes and their aftershocks is one exam- 
ple. Such power-law distributions of times are observed 
in critical models and in the coherent noise mechanism 
mentioned above, but another possible explanation for 
their occurrence is a random extremal process or record 
dynamics. In this mechanism we consider how often a 
randomly fluctuating quantity will break its own record 
for the highest value recorded. For a quantity with, say, a 
Gaussian distribution, it is always in theory possible for 
the record to be broken, no matter what its current value, 
but the more often the record is broken the higher the 
record will get and the longer we will have to wait until it 
is broken again. As shown by Sibani and Littlewood [t^ , 
this non-stationary process gives a distribution of wait- 
ing times between the establishment of new records that 
follows a power law with exponent a = I. Interestingly, 



this is precisely the exponent observed for the distribu- 
tion of waiting times for aftershocks of earthquakes. The 
record dynamics has also been proposed as a model for 
the lifetimes of biological taxa (tJ . 



V. CONCLUSIONS 

In this review I have discussed the power-law statis- 
tical distributions seen in a wide variety of natural and 
man-made phenomena, from earthquakes and solar flares 
to populations of cities and sales of books. We have seen 
many examples of power-law distributions in real data 
and seen how to analyse those data to understand the be- 
haviour and parameters of the distributions. I have also 
described a number of physical mechanisms that have 
been proposed to explain the occurrence of power laws. 
Perhaps the two most important of these are: 

1. The Yule process, a rich-get- richer mechanism in 
which the most populous cities or best-selling books 
get more inhabitants or sales in proportion to the 
number they already have. Yule and later Simon 
showed mathematically that this mechanism pro- 
duces what is now called the Yule distribution, 
which follows a power law in its tail. 

2. Critical phenomena and the associated concept of 
self-organized criticality, in which a scale-factor of a 
system diverges, either because we have tuned the 
system to a special critical point in its parameter 
space or because the system automatically drives it- 
self to that point by some dynamical process. The 
divergence can leave the system with no appropri- 
ate scale factor to set the size of some measured 
quantity and as we have seen the quantity must 
then follow a power law. 

The study of power-law distributions is an area in 
which there is considerable current research interest. 
While the mechanisms and explanations presented here 
certainly offer some insight, there is much work to be 
done both experimentally and theoretically before we can 
say we really understand the physical processes driving 
these systems. Without doubt there are many exciting 
discoveries still waiting to be made. 
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APPENDIX A: Rank/frequency plots 

Suppose we wish to make a plot of the cumulative dis- 
tribution function P{x) of a quantity such as, for exam- 
ple, the frequency with which words appear in a body 
of text (Fig. 2^). We start by making a list of all the 
words along with their frequency of occurrence. Now the 
cumulative distribution of the frequency is defined such 
that P{x) is the fraction of words with frequency greater 
than or equal to x. Or alternatively one could simply 
plot the number of words with frequency greater than 
or equal to which differs from the fraction only in its 
normalization. 

Now consider the most frequent word, which is "the" 
in most written English texts. If x is the frequency with 
which this word occurs, then clearly there is exactly one 
word with frequency greater than or equal to x, since no 
other word is more frequent. Similarly, for the frequency 
of the second most common word — usually "of" — there 
are two words with that frequency or greater, namely 
"of" and "the". And so forth. In other words, if we 
rank the words in order, then by definition there are 
n words with frequency greater than or equal to that 
of the nth most common word. Thus the cumulative 
distribution P{x) is simply proportional to the rank n 
of a word. This means that to make a plot of P{x) 
all we need do is sort the words in decreasing order 
of frequency, number them starting from 1, and then 
plot their ranks as a function of their frequency. Such 
a plot of rank against frequency was called by Zipf 
a rank/ frequency plot, and this name is still sometimes 
used to refer to plots of the cumulative distribution of a 
quantity. Of course, many quantities we are interested in 
are not frequencies — they are the sizes of earthquakes or 
people's personal wealth or whatever — but nonetheless 
people still talk about "rank/frequency" plots although 
the name is not technically accurate. 

In practice, sorting and ranking measurements and 
then plotting rank against those measurements is usu- 
ally the quickest way to construct a plot of the cumula- 
tive distribution of a quantity. All the cumulative plots 
in this paper were made in this way, except for the plot 
of the sizes of moon craters in Fig. 0^, for which the data 
came already in cumulative form. 



APPENDIX B: Maximum likelihood estimate of exponents 

Consider the power-law distribution 



tional to 
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where we have made use of the value of the normalization 
constant C calculated in Eq. 0. 

Given a set of n values Xi, the probability that those 
values were generated from this distribution is propor- 
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This quantity is called the likelihood of the data set. 

To find the value of a that best fits the data, we need 
to calculate the probability P(a\x) of a particular value 
of a given the observed {xi}, which is related to P{x\a) 
by Bayes' law thus: 



P{a\x) = P{x\a 



P(a) 
P{x)' 



(B3) 



The prior probability of the data P{x) is fixed since x 
itself is fixed — x is equal to the particular set of ob- 
servations we actually made and does not vary in the 
calculation — and it is usually assumed, in the absence of 
any information to the contrary, that the prior proba- 
bility of the exponent P{a) is uniform, i.e., a constant 
independent of a. Thus P{a\x) oc P{x\a). For conve- 
nience we typically work with the logarithm of P{a\x), 
which, to within an additive constant, is equal to the log 
of the likelihood, denoted £ and given by 
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Now we calculate the most likely value of a by maximiz- 
ing the likelihood with respect to a, which is the same 
as maximizing the log likelihood, since the logarithm is 
a monotonic increasing function. Setting d£/da — 0, we 
find 
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We also wish to know what the expected error is on our 
value of a. We can estimate this from the width of the 
maximum of the likelihood as a function of a. Taking the 
exponential of Eq. ljB4|l , we find that that the likelihood 
has the form 
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where b — ^"^^^ ln(a;i/a:min) and a is an unimportant 
normalizing constant. Assuming that a > 1 so that the 
distribution (|B1I) is normalizablc, the mean and mean 
square of a in this distribution are given by 

_ /re-''"(a- l)"ada 



'{a - 1)" da 
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and 
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where r(x) is the F-function of Eq. (|21|l . Then the vari- 
ance of a is 

_ + 3n + 6^ + 26 + 2n6 + 2 (n + 1 + 6)^ 



62 



62 



n + 1 
" ~6^' 

and the error on a is 
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In most cases we will have n ^ 1 and it is safe to ap- 
proximate n + 1 by n, giving 
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where a in this expression is the maximum likelihood 
estimate from Eq. (jB6|) . 
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